Strategies for data normalization and missing data imputation and consequences for potential diagnostic microRNA biomarkers in epithelial ovarian cancer

MicroRNAs (miRNAs) are small non-coding RNA molecules regulating gene expression with diagnostic potential in different diseases, including epithelial ovarian carcinomas (EOC). As only a few studies have been published on the identification of stable endogenous miRNA in EOC, there is no consensus which miRNAs should be used aiming standardization. Currently, U6-snRNA is widely adopted as a normalization control in RT-qPCR when investigating miRNAs in EOC; despite its variable expression across cancers being reported. Therefore, our goal was to compare different missing data and normalization approaches to investigate their impact on the choice of stable endogenous controls and subsequent survival analysis while performing expression analysis of miRNAs by RT-qPCR in most frequent subtype of EOC: high-grade serous carcinoma (HGSC). 40 miRNAs were included based on their potential as stable endogenous controls or as biomarkers in EOC. Following RNA extraction from formalin-fixed paraffin embedded tissues from 63 HGSC patients, RT-qPCR was performed with a custom panel covering 40 target miRNAs and 8 controls. The raw data was analyzed by applying various strategies regarding choosing stable endogenous controls (geNorm, BestKeeper, NormFinder, the comparative ΔCt method and RefFinder), missing data (single/multiple imputation), and normalization (endogenous miRNA controls, U6-snRNA or global mean). Based on our study, we propose hsa-miR-23a-3p and hsa-miR-193a-5p, but not U6-snRNA as endogenous controls in HGSC patients. Our findings are validated in two external cohorts retrieved from the NCBI Gene Expression Omnibus database. We present that the outcome of stability analysis depends on the histological composition of the cohort, and it might suggest unique pattern of miRNA stability profiles for each subtype of EOC. Moreover, our data demonstrates the challenge of miRNA data analysis by presenting various outcomes from normalization and missing data imputation strategies on survival analysis.

Introduction strategies (endogenous miRNA controls, U6-snRNA, and global mean) in terms of their impact on the survival analysis.

Patient cohort
Tumor tissues stored as formalin-fixed and paraffin embedded (FFPE) were acquired from two Danish projects: the Pelvic Mass study (2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) and the GOVEC (Gynecological Ovarian Vulva Endometrial Cervix cancer) study (2015 -ongoing) through the Bio-and Genome Bank Denmark [27]. The study has been approved by the Danish National Committee for Research Ethics, Capital Region (H-17029749/H-15020061) and was performed according to the guidelines of the Declaration of Helsinki, including written informed consent from all patients. Tissues from a total of 63 patients diagnosed with EOC (International Federation of Gynecology and Obstetrics staging-FIGO stage III/IV) were included in this study. All patients were followed from the surgery date until either death of any cause, emigration or November 11, 2022.

RNA extraction and miRNA profiling by RT-qPCR
Total RNA was extracted from FFPE tissue slices by use of miRNeasy FFPE kit (Qiagen, cat. No. 217504) and subjected to RT-qPCR as described previously [28]. Briefly, two tissue sections with 5 μm thickness were sliced from each paraffin block and immersed in 160 μl deparaffinization solution (Qiagen), followed by steps described by the manufacture's protocol. 1 μl RNA isolation spike-in control mix (UniSp2, UniSp4, and UniSp5) (Qiagen) was added to each sample. Reverse transcription of RNA was accomplished by miRCURY LNA RT Kit (Qiagen, cat. no. 339340) by adding 10 ng of total RNA and 0.5 μl cDNA synthesis spike-in control mix containing UniSp6 and cel-miR-39-3p in 10 μl reaction volumes. RT-qPCR reactions were performed using miRCURY LNA SYBR Green PCR Kit (Qiagen, cat. no. 339347), miR-CURY Custom PCR Panels in a 384-well plate format (Qiagen, cat. no. 339332, design no. YCA33809) and a LightCycler 480 (Roche, Hvidovre, Denmark).
The miRCURY Custom PCR panels contained 48 locked nucleic acid (LNA) PCR assays for detection of 1) 40 miRNAs selected based on their stability or biomarker potential, 2) the RNA isolation spike-in controls added at the beginning of each isolation procedure (UniSp2, UniSp4, and UniSp5), 3) the cDNA synthesis spike-in controls (UniSp6 and cel-miR-39-3p), 4) UniSp3, the interplate calibrator (UniSp3_IPC), to identify between runs variations, 5) U6-snRNA which is commonly used as the endogenous control, and 6) blank spot to control unwanted RT-qPCR outcomes. A design of a customized plate can be found in S1 Fig. RT-qPCR reactions were prepared according to the manufacture's protocol. Briefly, a large pool containing the 2x miRCURY SYBR Green Master Mix and RNase free water was prepared, and cDNA was added at a ratio of 1:100 of a total volume. The mixture was then homogenized thoroughly and briefly centrifuged before distributing 10 μl into individual PCR wells. The LightCycler1480 software version 1.5 (Roche) and absolute quantification analysis/2 nd derivative maximum method with high confidence setting was employed to determine the baseline and the Crossing points (Cps) of the amplification curves for each run. The customized panels were subjected to RT-qPCR experiments and raw data was collected from 10 panels (8 patients per panel) (S2 Table). After interplate calibration with UniSp3_IPC, the isolation spike-in controls (UniSp2, UniSp4, and UniSp5) and the cDNA synthesis controls (UniSp6 and cel-miR-39-3p) were then used to evaluate the efficiency of the process and exclude any samples due to questionable quality, described according to previously published protocol [28]. The isolation spike-in controls are mixed such that UniSp2 is present at a concentration 100-fold higher than UniSp4, and UniSp4 is present at a concentration 100-fold higher than UniSp-5. Therefore, approximately 6.6 cycles difference is expected between UniSp4 and UniSp2, as well as between UniSp5 and UniSp4. As the concentration of UniSp5 reflects weakly expressed miR-NAs, its detection might not always be possible. Moreover, since the input RNA amount for cDNA synthesis had to be adjusted to 10 ng in total, the cDNA samples were prepared with different RNA dilution ratios, as the concentrations of isolated RNA varied widely, which led to relatively high variation in observed UniSp2 and UniSp4 levels. Therefore, all samples with UniSp2 Cp values above 30 and outside 5-8 Cp difference range between UniSp4 and UniSp2 were excluded from further analysis (S2 Table).

Data handling and analysis
All data analyses were performed using R Statistical Software (version 4.

Endogenous control selection
In order to identify stable endogenous controls, only miRNAs without missing values and Cp values below 35 (complete cases) were included. The stability of the miRNA investigated was assessed by utilizing online expression stability tool RefFinder [30] implemented into R software by use of R package: RefSeeker developed by our group. The tool enables the identification of stably expressed candidates by several well-known algorithms; BestKeeper [31], NormFinder [32], geNorm [19] and the comparative delta-Ct method [33]. Moreover, RefFinder assigns a weight to the ranking of the four statistical algorithms and calculate a geometric mean to determine an overall ranking of the analyzed genes.

Validation in external datasets
We retrieved data from two independent datasets: GSE81873 [34] and GSE43867 [35] by use of the NCBI Gene Expression Omnibus database. For detailed information regarding biospecimen collection, clinical data, and sample processing, we refer to the original publications for each dataset. We employed the R package miRBaseConverter to unify miRNA annotation from two datasets to the latest miRBase version (version 22) [36]. MiRNA entries not presented in the latest version of the miRBase database were excluded from further analysis. Only targets (miRNAs and controls) without missing values across entire dataset and Cp values below 35 were included to perform stability analysis as described in the section"Endogenous control selection".

Data normalization
Three normalization methods were applied to one complete_cases and four datasets from missing data imputation step (Fig 1):

Missing data
The pattern for incomplete dataset was investigated by use of two R packages: "finalfit" [37] and "ggmice" [38]. The missingness of a single dependent variable-a specific miRNA with missing values was tested against explanatory variables individually: overall survival (OS), OS status, age at diagnosis, histological subtype, and a customized panel number as multiple univariate analyses. Continuous data were compared with a Kruskal Wallis test, whereas discrete data are compared with a chi-squared test [37]. In order to identify stable endogenous controls, only miRNAs without missing values were included. To investigate possible impact of different imputation methods, Cp values equal to or above 35 were marked as missing values Only miRNAs without missing values (complete_cases) were included to assess their stability with five algorithms: the comparative delta-Ct method (deltaCt), BestKeeper, NormFinder, geNorm, and RefFinder. Our findings were further validated in external cohorts. Single or multiple imputation methods were applied on miRNAs with missing value ratio lower than 20%. Three normalization strategies were used individually on different input datasets. Finally, survival analysis by Cox regression was performed on fifteen differently processed datasets originated from the same RT-qPCR raw dataset. https://doi.org/10.1371/journal.pone.0282576.g001

PLOS ONE
("NA"). The imputation methods were further applied on miRNAs with missing value ratio lower than 20%. To fill missing values, simple and multiple imputation approaches were applied. As simple imputation methods, we tested two approaches: replacing missing values with Cp = 35 (imp_na_35) or replacing missing values by "highest Cp value for investigated miRNA + 1" (imp_max_one) [39]. To fill missing values by multiple imputation, 3 strategies were employed: • nonparametric missing value imputation using random forest enabled by an R package "missForest" [40] (imp_missF) • k-nearest neighbour imputation with an R package "VIM" [41] (imp_VIM).
The coefficient of variation for individual miRNAs (CoV_ind) was calculated as the ratio between standard deviation and mean of the Cp value of this miRNA across all samples in 4 differently imputed datasets.
To evaluate performance of nine endogenous candidates, their expression was measured by RT-qPCR along the expression of miRNAs that have been reported as potential stable candidate biomarkers in OC ( Table 1).

Identification of endogenous miRNAs controls
Only miRNAs without missing values (complete cases, Table 2) were included to assess their stability with five well-known algorithms; BestKeeper (BestK) [31], NormFinder (NormF) [32], geNorm [19], the comparative delta-Ct method (deltaCt) [33] and RefFinder [30] ( Table 3). The latter algorithm assigns a weight to the ranking of the former ones and calculates a geometric mean to determine an overall ranking of the analyzed miRNAs. The differences between rankings from various algorithms can be seen, for example hsa-miR-23a-3p is ranked as most stable by deltaCt, NormF, geNorm and RefFinder, but according to the BestK is ranked at the 8 th position. U6-snRNA, commonly used control is not among top most stable candidates for various algorithms.

Validation in external datasets
Our results were validated using two external datasets retrieved from the NCBI Gene Expression Omnibus database: GSE81873 and GSE43867 (Table 4). After filtering miRNAs with missing values and values above 35, there were 81 and 162 targets, respectively in the cohorts GSE81873 and GSE43867, included for the stability analysis. Rankings for miRNAs shared between cohorts are presented in Tables 5 and 6, whereas full rankings for both datasets can be

PLOS ONE
found in the S3 Table. The differences between rankings were observed, for example hsa-miR-193a-5p was ranked among top candidates for our cohort (RefFinder rank 2/21) and GSE81873 dataset (RefFinder rank 10/89), but not for the GSE43876 cohort (RefFinder rank 72/162), when only HGSC patients were included. U6-snRNA was ranked as 12 in our dataset, 45 in GSE81873, and 161 in GSE43876 according to RefFinder stability rank. The differences between various algorithms can be seen and are more pronounced with higher number of miRNAs being included in the analysis. In order to evaluate the composition of the cohort on the choice of endogenous controls, we performed stability analysis in different subgroups in the external cohorts and presented rankings for chosen miRNAs in Table 6.

Imputation of missing data
After adjusting the raw RT-qPCR data from 10 panels (8 patients per panel) by the interplate calibrator, the pattern of missingness of an incomplete dataset was investigated ( Table 2). The imputation methods were further applied on miRNAs with missing value ratio lower than 20%, resulting in excluding hsa-miR-1183, hsa-miR-135a-3p, hsa-miR-149-3p, hsa-miR-23a-5p, hsa-miR-27a-3p, hsa-miR-27a-5p, hsa-miR-302d-3p, hsa-miR-506-3p, hsa-miR-595, hsa-miR-802 and hsa-miR-92b-5p. In the next step, the missingness pattern of a specific miRNA with missing values was tested against explanatory variables: OS, OS status (dead or alive), age at diagnosis, and a customized panel number, but no dependence has been observed. In order to fill missing values, simple (na_max_one and na_35) and multiple imputation (missF, VIM) approaches were applied and their effects on the CoV values on individual

PLOS ONE
miRNAs are presented in Table 7. The results show that the CoV values for individual miRNAs are not affected by multiple imputation methods (missF and VIM), however single imputation methods lead to higher CoVs when comparing to CoVs from the raw dataset (data_raw). For hsa-miR-126-3p and hsa-miR-1234-3p, which had many missing values, the CoVs are approximately 1.6 times higher than CoVs from the raw dataset.
In the next step, three different methods of normalization (ENDO, U6-snRNA, GMean) were performed on complete cases dataset and on each imputed dataset, which resulted in

Discussion
With this study, hsa-miR-23a-3p, and hsa-miR-193a-5p were identified as most stable in a cohort of 48 patients with HGSC. Interestingly, these miRNAs were included in our study because they were reported previously as potential biomarkers in EOC (Table 1) [60] to discriminate between type I and II tumours (hsa-miR-193a-5p) and to predict progression free survival (hsa-miR-23a-3p) pointing towards an explanation of the very different effects of miRNA observed in cohorts with different composition of EOC patients. Moreover, as we showed in Table 6, the pattern of stability for a particular miRNA (increased or decreased stability rank) is dependent on the histological composition of the cohort, but also varies between different miRNAs. For example, hsa-miR-193a-5p stability is increasing, when there is only one subtype of EOC present (HGSC) versus a cohort with SC and benign cases included (GSE81873), however, opposite effect is observed for hsa-miR-191-5p. In the study that used the cohort GSE43867, two miRNAs were used as endogenous stable controls: hsa-miR-16-5p and hsa-miR-191-5p, however how they were chosen is not precisely described. According to our stability analysis, these miRNAs were not among top ten most stable targets for any of the algorithms and according to RefFinder they were ranked on the position 37 -hsa-miR-191-5p and on the position 121 -hsa-miR-16-5p (S3 Table). In case of GSE81873, the top 10 micro-RNAs obtained from the geNorm algorithm were used to normalize data. However, the names of these miRNAs are not mentioned in the description, therefore it would be difficult to compare it with our analysis. Nevertheless, different stability algorithms selected various top 10  Table 3   Table 5  Table 5   Table 6 https://doi.org/10.1371/journal.pone.0282576.t004

PLOS ONE
miRNAs for this dataset according to our stability analysis (S3 Table). Therefore, in general more detailed description of various data processing steps would be recommended to enable validation of the results across different research centers. U6-snRNA that is commonly used as the endogenous control in OC studies [61][62][63] was not in the top 10 of most stable candidates for three algorithms: deltaCt, NormFinder, geNorm, and RefFinder (rank 14 or 15 dependently on the imputation method- Table 3). Similar findings were made in external validation cohorts while considering HGSC patients, in which

PLOS ONE
U6-snRNA was ranked as 45 th among 81 targets in GSE81873, whereas in GSE43867 was on the position 161 among 162 targets considered (Table 5). These observations indicate that U6-snRNA is not a suitable endogenous candidate for normalization in EOC, which is in line with some previous reports showing high inter-individual variances and expression instability of U6-snRNA in cancers [20,21]. Our results emphasize that it is crucial to select suitable endogenous controls to ensure the reliability of RT-qPCR, especially when working with miR-NAs in a clinical context in a complex, heterogenous diseases such as EOC, as the conclusions that are drawn might be dependent on the composition of the cohort and might impact clinical decisions.
In our previous study, we collected miRNA-microarray data from four datasets: the inhouse "Pelvic Mass", and three public datasets with primary EOC patients: The Cancer Genome Atlas, GSE47841, and GSE73581 in order to find endogenous control candidates [22]. We found that two miRNAs: hsa-miR-106b-3p and hsa-miR-92b-5p were among the top 100 candidates for all datasets, when considering only miRNAs mutual for all datasets. Their stability was not evaluated in this study, as none of them were available as a complete case (hsa-miR-106b-3p -2.1% of missing data, hsa-miR-92b-5p -33.3% of missing data).
Moreover, we furthermore examined the impact of various data handling on survival analysis in 48 HGSC patients. Each miRNA in each of 15 differently processed datasets (Fig 1) was subjected to univariate Cox regression analysis and miRNA being predictors of OS (hsa-miR-126-3p and hsa-miR-1301-3p) were only found in the datasets that included miRNAs with missing values (Table 8). Interestingly, these miRNAs were included because of their biomarker potential: hsa-miR-126-3p was associated with OS in EOC and hsa-miR-1301 was shown to be involved in cisplatin resistance (Table 1). However, these miRNAs were not complete cases ( Table 2) and further investigation is required to understand their role in EOC. As shown in Table 7, the CoV values for individual miRNAs were mainly affected by single imputation methods, whereas subtle differences were observed when comparing datasets filled by multiple imputation methods with the raw dataset. This is not surprising, as imputed values are estimated from the detected Cp values. However, such imputation might assign Cp values to some miRNAs that were in fact not detectable because of a biological reason (no target in the sample or very low concentration) not a technical failure and might lead to false conclusions [39]. Therefore, one might consider increasing the number of biological/technical replicates per sample in order to decrease the number of miRNAs with missing data or to confirm their missingness because of biological reasons, such as regulatory loops of other genes expression depending on the biological context of each EOC subtype. Moreover, the sample age and type could be additionally included as a potential factor that could impact miRNA studies.

PLOS ONE
Nonetheless, previous studies have showed a good correlation of miRNA between fresh-frozen and FFPE when including several hundred microRNAs [64,65].

Conclusions
Our study demonstrates the need of awareness in the choice of RT-qPCR missing data handling and data normalization approaches in miRNA biomarker studies. We suggest that other endogenous controls than U6-snRNA, which seems to be unstable in various EOC cohorts, should be considered and we identified hsa-miR-23a-3p and hsa-miR-193a-5p among top candidates for HGSC patients. We presented that the pattern of miRNA stability depends on the histological composition of the cohort and it might suggest that each subtype of EOC might be characterized by its unique miRNAs expression pattern. Moreover, in order to achieve consensus on miRNA-related studies in EOC, future studies regarding miRNA data analysis are required, as e.g., many studies do not precisely describe how handling of missing data was performed, which can lead to biased results and hinder the validation efforts.